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Abstract 

In this article, we derive concentration inequalities for the cross-validation estimate of the 
generalization error for stable predictors in the context of risk assessment. The notion of stability 
has been first introduced by [DEWA79] and extended by [KEA95| , [BE01| and [KUNIY02] to 
characterize class of predictors with infinite VC dimension. In particular, this covers fc-nearest 
neighbors rules, bayesian algorithm ( KEA95 ), boosting,. . . General loss functions and class of 
predictors are considered. We use the formalism introduced by DUD03 to cover a large variety of 
cross-validation procedures including leave-one-out cross-validation, fe-fold cross-validation, hold- 
out cross-validation (or split sample), and the leave-v-out cross-validation. 

In particular, we give a simple rule on how to choose the cross-validation, depending on the stabil- 
ity of the class of predictors. In the special case of uniform stability, an interesting consequence is 
that the number of elements in the test set is not required to grow to infinity for the consistency 
of the cross-validation procedure. In this special case, the particular interest of leave-one-out 
cross-validation is emphasized. 

Keywords: Cross-validation, stability, generalization error, concentration inequality, optimal split- 
ting, resampling. 
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1 Introduction and motivation 



One of the main issue of pattern recognition is to create a predictor (a regressor or a classifier) which 
takes observable inputs in order to predict the unknown nature of an output. Formally, a predictor ip is 
a measurable map from some measurable space X to some measurable space y. When y is a countable 
set (respectively R m ), the predictor is called a classifier (respectively a regressor). The strategy of 
Machine Learning consists in building a learning algorithm $ from both a set of examples and a class 
of methods. Typical class of methods are empirical risk minimization or fc-nearest neighbors rules. 
The set of examples consists in the measurement of n observations (xj, j/j)i<j< n . Thus, formally, 
$ is a measurable map from X x U n (X x y) n to y. One of the main issue of Statistical Learning 
is to analyze the performance of a learning algorithm in a probabilistic setting, (xj, yi)x<i< n are 
supposed to be observations from n independent and identically distributed (i.i.d.) random variables 
(Xi,Yi)i<i<n with unknown distribution P. (^Q,li)i<i< n is denoted T> n in the following and called 
the learning set. In order to analyze the performance, it is usual to consider the conditional risk of 
a machine learning $ denoted R n , so called the generalization error. It is defined by the conditional 
expectation of L(Y, $(X, T> n )) given V n where (X, Y) ~ P is a random variable independent of T> n , 
i.e. R„, := Ex,y(L(Y, $(X,Z> n ))|Z>„) with L a cost function from y 2 — ► M+. Notice that R n is a 
random variable measurable with respect to T> n . 

An important question is: the distribution P of the generating process being unknown, can we estimate 
how good a predictor trained on a learning set of size n is? In other words, can we approximate the 
generalization error R n l This fundamental statistical problem is referred to "choice and assessment 
of statistical predictions" |STQ74j . Many estimates have been proposed. Quoting [HTFOlj : Probably 
the simplest and most widely used method for estimating prediction error is cross-validation. 

The cross-validation procedures include leave-one-out cross-validation, fc-fold cross-validation, hold- 
out cross validation (or split sample), leave- f-out cross-validation (or Monte Carlo cross-validation 
or bootstrap cross-validation). With the exception of [BUR89] , theoretical investigations of multifold 
cross-validation procedures have first concentrated on linear models ([Li87 ; SHA093 ;[ZHA93 ). Re- 
sults of |DGL96j and |GYO02] are discussed in Section 3. The first finite sample results are due to 
Wagner and Devroye [DEWA79 and concern fc-local rules algorithms under leave-one-out and hold- 
out cross-validation. More recently, HOL96, HOL96bis derived finite sample results for u-out cross- 
validation, A;— fold cross-validation, and leave-one-out cross-validation for Empirical Risk Minimization 
(ERM) over a class of predictors with finite Vapnik-Chervonenkis-dimension (VC-dimension)in the 
realisable case (the generalization error is equal to zero). |BKL99) have emphasized when fc— fold can 
beat u-out cross-validation in the particular case of fc-fold predictor. [KR99 has extended such results 
in the case of stable algorithms for the leave-one-out cross-validation procedure. KE A95j also derived 
results for hold-out cross-validation for ERM, but their arguments rely on the traditional notion of 
VC-dimension. In the particular case of ERM over a class of predictors with finite VC-dimension 
but with general cross-validation procedures, we derived derived probability upper bounds in chapter 
1: we denote by p n the percentage of elements in the test sample. In the sequel, we will denote by 
Rev the cross-validation estimator. For empirical risk minimizers over a class of predictors with finite 
VC-dimension Vc, to be defined below, we obtained the following concentration inequality. For all 
e > 0, we have 

Pr (\R C v - Rn\ > e) < B(n,p n ,e) + V(n,p n ,e), 

with 

4v c TIE 2 

• B{n,p n ,e) = 5(2n(l -p n ) + l)i=^ exp( — — ), 

04 

v , \ . ( i 2rcp n e% 16 / y c (ln(2(l- Pn ) + l)+4) > \ 

• V(n,pn,e) = mm exp _—),—./ . 
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Unfortunately, many popular predictors, including fc-nearest neighbors rules, do not satisfy this prop- 
erty. Moreover, these bounds obtained are called "sanity check bounds" since they are not better than 
classical Vapnik-Chernovenkis's bounds. 

To avoid the traditional analysis in the VC framework, notions of stability have been intensively 
worked through in the late 90's [KEA95] . [BEOl] . [BEU2] . [KUT02] . and IKl MYO'-'I. The object 
of stability framework is the learning algorithm rather than the space of classifiers. The learning 
algorithm is a map (effective procedure) from data sets to classifiers. An algorithm is stable at a 
learning set T> n if changing one point in T> n yields only a small change in the output hypothesis. The 
attraction of such an approach is that it avoids the traditional notion of VC-dimension, and allows 
to focus on a wider class of learning algorithms than empirical risk minimization. For example, this 
approach provides generalization error bounds for regularization-based learning algorithms that have 
been difficult to analyze within the VC framework such as boosting. As a motivation, we quote the 
following list of algorithms satisfying stability properties: regularization networks, ERM, k-nearest 
rules, boosting. 

Algorithmic stability was first introduced by |DEWA79j . [?] argued that unstable weak learners benefit 
from randomization algorithms such as bagging. }KR991 considered both algorithmic stability and the 
weaker related notion of error stability. They proved bounds on the error of cross-validation estimates 
of generalization error, but their arguments rely on VC theory. BE01 , BE02 proved that an algorithm 
which is stable everywhere has low generalization error; their proof does not make any reference to VC- 
dimension. They showed that regularization networks are stable. In [KUNIY02 , at least ten different 
notions were examined. In particular, they introduced a probabilistic notion of change-one stability 
called Cross- Validation stability or CV stability. This was shown to be necessary and sufficient for 
consistency of ERM in the Probably Approximately Correct (PAC) Model of |VAL84)] . 

The goal of this paper is to obtain exponential bounds to fill the chart [T] where possible bounds are 
missing (up to our knowledge). 



lcavc-onc-out hold-out k-fold v-out 

ERM with 

finite VC-dimcnsion Kcarns, Holdcn, Corncc Holdcn,Corncc Holdcn, Corncc Corncc 

hypothesis stability Dcvroyc and W Dcvroyc and W XX 
error stability 

with finite VC dimension Kearns Kcarns X X 

uniform stability Bousquct and E. X XX 

strong hypothesis Kutin and N X XX 

weak stability x x xx 



Table 1: Missing bounds x to find 

The goal of this article is also to show that cross-validation is still consistent for stable predictors. As 
a consequence, we will emphasize the role played by cross-validation: it can be a consistent estimate 
of the generalisation error when the training error defined by R n := — Y^i=i L(Yi> 0(V i ,2? n )) is not. 
Indeed, for stable predictors, the training error can be arbitrarly poor: for example, the training error 
for 1-nearest neighboor is equal to zero whatever the generalisation error may be. 

We introduce our main resu itfl Suppose that the cross-validation is symmetric -i.e. the probability 
of a observation to be in the training set is independent of its index- and that the number of elements 
in the test set is constant and equal to np n with p n the percentage of elements in the test set. All the 
bounds of the following form Pr (\Rcv ~ Rn\ >£ + ■■•) — B(n,p n ,e) + V(n,p n ,e). 
Under certain stability conditions -satisfied for example by Empirical Risk Minimisers (ERM) or 
Adaboost-, we have for all e > 0, 

Pr {\R C v - Rn\ > e + 2Xp n ) < 2 exp(-2np„e 2 ) + 5 n , Pn 

1 accurate inequalities can be found in section [3] 
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with S„ iPn and A a non-negative real numbers. For classical algorithms, we have in mind that £>„ jP?l = 
O n (p n exp(— n(l — p n ))- A is in fact a Lipschitz coefficient with respect to the total variation and 
can be interpreted as a stability factor: the smaller A is, the more stable the learning algorithm is. 
Furthermore, if the learning algorithm satisfies a stronger stability condition (for example Adaboost 
or regularization networks) , we obtain 

Pv(\Rcv-Rn\>e + S n , Pn +2Xp n )<4(eM- s{mx)2np2 ) + -^< P J) 

with S n Pn — S n , Pn + (n + l)S n i/ n . For the latter, it is thus not required that the number of elements 
in the test set grows to infinity for the consistency of the cross-validation to hold. 

Using these probability bounds, we can then deduce that the expectation between the generalization 
error and the cross-validation error Ex> n \Rcv — Rn\ is of order O n ((A/n) 1 / 3 ). As far as the expectation 
Ex>Ji?cv — Rn\ is concerned, we can define a splitting rule in the general setting: the percentage of 
elements p* n in the test set should be proportional to (1/A n) 1 / 3 , i.e. the less stable (i.e. A large) 
the learning algorithm is, the smaller the test set in the cross-validation should be. Furthermore, if 
the learning algorithm satisfies a stronger stability condition (for example Adaboost or regularization 
networks), we also have Ej> n \Rcv — Rn\ — On(A/ v / «) and the leave-one-out cross-validation (i.e. 
p* = 1/ n) is preferred for n large enough. 

The paper is organized as follows. In the next section, we recall the main notations and definitions of 
cross-validation as introduced in chapter 1. We also introduce notations to unify the main notions of 
stability. Finally, in Section 3, we introduce our results in terms of probability upperbounds. We also 
prove that many traditionnal methods satisfy our generalized notion of stability (lasso,..., adaboost, 
k-nearest neighbors). 

2 Notations and definitions 

In the following, we follow the notations of cross-validation introduced in chapter 1. 
2.1 Cross-validation 

We will consider the following shorter notations inspired by the literature on empirical processes. In 
the sequel, we will denote Z := X x y, and (^)i<i<„ := ((Xi, li))i<i<„ the learning set. For a 
given loss function L and a given class of predictors Q, we define a new class T of functions from 
Z to R + by T := {ijj G Rf\tp(Z) = L(Y, (f>{X)),4> e Q}. For a machine learning we have the 
natural definition ^(Z,T> n ) :— L(Y,^(X,T> n )). With these notations, the conditional risk R n is the 
expectation of ^(Z, V n ) with respect to P conditionally on V n : R n := Ez [<£(.£, V n ) \ T> n ] with Z <~ P 
independent of V n . In the following, if there is no ambiguity, we will also allow the following notation 
ip(X,V n ) instead of ^(X,V n ). 

To define the accurate type of cross-validation procedure, we introduce binary vectors. Let V n = 
{V n ,i)i<i<n be a vector of size n. V n is a binary vector if for all 1 < i < n,V n ,i € {0,1} and 
^ Y^i=i Vn,i 7^ 0- Consequently, we can define the subsample associated with it, T>y n := {Zi e 
T^n\V n ,i = 1, 1 < i < n}. We define a weighted empirical measure on Z 

1 n 

^n,V n ■= TT~ ^ Vn,i$Zj, 

Z^i=l V n,i j=1 

with 8zi the Dirac measure at {Zi}. We also define a weighted empirical error ¥ n y n ip where ¥ n y n ip 
stands for the usual notation of the expectation of ip with respect to P n ,v n - For P n ,i n , with 1„ the 
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binary vector of size n with 1 at every coordinate, we will use the traditional notation P„. For a 
predictor trained on a subsample, we define 

Vv„(.) :=*(., V Vn ). 

With the previous notations, notice that the predictor trained on the learning set if>(.,T> n ) can be 
denoted by We will allow the simpler notation ip n {.). The learning set is divided into two 

disjoint sets: the training set of size n(l — p n ) and the test set of size np n , where p n is the percentage 
of elements in the test set. To represent the training set, we define V^ r a random binary vector of size 
n independent of T> n . V* r is called the training vector. We define the test vector by V^ s := 1„ — V^ r 
to represent the test set. 

The distribution of V^ r characterizes all the cross-validation procedures described in the previous 
section (see e.g. chapter 1). Using our notations, we can now define the cross-validation estimator. 

Definition 1 (Cross-validation estimator) With the previous notations, the generalized cross- 
validation error of ip n denoted Rcv(ipn) "is defined by 

Rcvtyn) ■= ^v^V n yts(il; v t r ). 

We will give here an example of distributions of V^ r to illustrate we retrieve cross-validation procedures 
described previously. Leave-u-out cross-validation is an elaborate and expensive version of cross- 
validation. This procedure divides the data into two sets: the training set of size n — v and the test 
set of size v. It then produces a predictor by training on the training set and testing on the remaining 
test set. This is repeated for all possible subsamples of v cases, and the observed errors are averaged 
to form the leave-u-out estimate. Denote by K,,)kk(") the family of binary vectors of size n such 

that £2=i =n-v. 

Example 2 (Leave-u-out cross-validation) 

Pr(Kf 

pr(e =< ft) ) 

For other examples, see chapter one. 
2.2 Definitions and notations of stability 

The basic idea is that an algorithm is stable at a training set T> n if changing one point in T> n yields only 
a small change in the output hypothesis. Formally, a learning algorithm maps a weighted training set 
into a predictor space. Thus, stability can be translated into a Lipschitz condition for this mapping 
with high probability. 

To be more formal, following [KUNI Y02] . we define a distance between two weighted empirical errors. 

Let P n ,y„ and P n ,c/„ be two empirical measures on Z with respect to the binary vectors V n and 
U n . We do not assume their support to be equal. The distance between them is defined as their total 
variation, i.e. the number of points they do not have in common 

||P„,tr B ~ Pn,v„|| = sup |(P„, c/ „ - P„,v b )(j4)|. 
Aev(Z) 
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Example 3 In the case of leave-one-out (i.e. X^T=i ^n,i = n ~ 1 )> we have 

\\V n , Un -P„|| = -• 
n 

In i/ie case of leave-v-out, we get 

\K,U n -P n || = — • 

n 

In i/ie general setting, it follows that 

||P„,£/ n -P„|| = 2p„. 

At least, we need a distance d on the set J- '. Let us quote three important examples. Let ipi,ip2 
G J- . The uniform distance is defined by: d o(V'ij ^2) — s ^Pzez IV^C^) — '4'2(Z)\, the Li-distance 
by: di(ipi,ip2) = P|?/>i — ip%\ , the error-distance d e (tpi,ip2) = |P(V'i — ^2) |- It is important to notice 
that what matters here is not an absolute distance between the original class of predictors Q seen as 
functions but the distance with the respect to the loss or/and the distribution P. In particular, for 
the Li-distance, we do not care about the behavior of the original predictors ip± and Lp 2 outside the 
support of P. At last, notice that we always have d e < d\ < d^. 

We are now in position to define the different notions of stability of a learning algorithm which cover 
notions introduced by [KUNIY02]. We begin with the notion of weak stability. In essence, it says 
that for any given resampling vectors, the distance between two predictors is controlled with high 
probability by the distance between the resampling vectors. 

Definition 4 (Weak stability) Let a, A, (S niPn )n,p n be nonnegative real numbers. A learning algo- 
rithm \& is said to be weak (A, (<5„ jPrl ) npn , d) stable if for any training vector U n whose sum is equal 
to -p n ): 

Pr(d(ifo.,^„) > A||P„,ir B -P„m < <W 

Notice that in the former definition Pr stands for P® n . Indeed, ip n is trained with n observations, 
drawn independently from P. A stronger notion is to consider ip n trained with n— 1 observations drawn 
independently from P and an additionnal general observation z. We consider the stronger notion of 
strong stability. As a motivation, notice that algorithms such as Empirical Risk Minimization with 
finite VC dimension ( KUNIY02 ) satisfies this property. 

Definition 5 (Strong stability) Letz e Z. LetT> n = 2?„_iU{^} be a learning set. Let X, (<5 n .p„)n,p„ 
be nonnegative real numbers. A learning algorithm $ is said to be strong (A, (S n ,p n )n,p n 1 stable if 
for any training vector U n whose sum is equal to n(l — p n ) 

Pr(d(^,V„) > A||P„,c„ -P„m < <W 

What we have in mind for classical algorithms is S njPn = O n (p n exp(— n(l — p n )). We can state 
the last definition in other words. Let V^ r be a training vector with distribution Q such that the 
number of elements in the training set is constant and equal to n(l — p n ). Notice then that the former 
definition also implies that sup^g^pp^Q) P(ypp^pz|r i yp > A) < 5 ntPn , where support(Q) stands for 
the support of Q. The previous notion stands for any U n having the same support of Q. A stronger 
hypothesis would be that the previous probability stands uniformly over U n in support (Q). This leads 
formally to the notion of cross-validation stability. As a motivation, notice that algorithms such as 
Lasso f |BTW07] 1 satisfies this property. To be more accurate, we define 

Definition 6 (Cross-validation weak stability) Let T> n — (Zi)i<i< n a learning set. Let V^ r a 
training vector with distribution Q. Let A, (^n,p n )n, Pn be nonnegative real numbers. A learning algo- 
rithm ^ is said to be weak (A, {S n ,p n )n,p n , d,Q) stable if it is weak (A, (S n ,p n )n. Pn , d) stable and if: 

Pr ( SU P mp p~^-^- 6n ^- 

U„esupport(Q) W^rijUn r n\\ 
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As before, we also define the following stronger notion 

Definition 7 (Cross-validation strong stability) Let z 6 Z. Let T> n = T> n -\ U {z} a learning 
set. Let V* r be a cross-validation vector with distribution Q. A learning algorithm is said to be 
strongly (A, (5 n Pn ) n Pn , d, Q) stable if it is strong (A, {5 rhPn ) n Pn , e?) stable and if: 

Pr ( SU P Tim p-ii^ " A) " 

Remark 8 // the cardinal of the support of Q is denoted n(n), then a learning algorithm which is 
weak (A, (S njPn ) ntPn ,d,Q) -stable is also strong (A, (K(n)S njPn ) n , d,Q) -stable. 

At last, we consider the special important case when 5 ntPn = 0. This is the case in particular for 
regularization networks ([BEOT]). 

Definition 9 (Sure stability) Notice that when b~ n ^ PTL = 0, the two notions coincides and are called 
sure stability. 

As an example of strong stability, we develop the description of I H E95 who introduced the algorithm. 

Example 10 (Adaboost) We give an initial distribution p 1 and let w 1 — and Z\ = 1. Let $ 

be a learning algorithm. Let T the number of rounds. 
For each t = 1...T : 

1. Train the learning algorithm $ on the learning set with distribution p^\ The predictor obtained 
is denoted by ip*^ . 

2. For each i, let a\ = \ip t (xi) — the error of ip^' on instance i. 

3. Let e t = ~Y^T = iP t o} i , the error rate of ip^ with respect to p l \ 
4- Let (3 t = jz^T and let at = ln(l//3 t ) 

5. reweight the data: for all i, let w^ t+1 ^ — wf* fi\ a * ■ 

6. Normalize the distribution: let Z t +\ — 532=1 and p\ t+1 ^ — w^ 1 ^ / Z t +\ 



The final output is Ht{x) = Y^=i ct a ip^[x) . 
[KUNlYOll shows that under certain hypotheses, Adaboost is strongly stable: suppose the learner $ - 
(A, 0, doo) stable and other regularity assumptions, then Adaboost with T rounds is strong (A* , (<5* p ^ ) n ,p n , <^oo) 
stable for some A* and (5* Pn . 

We give now an example that is surely stable introduced in [BE01] . 

Example 11 (Regularization networks ) Regularization networks are attractive for their links 
with Support Vector Machines and their Bayesian interpretation. This learning algorithm consists 
in finding a function ip : R d — > K in a space H which minimizes the following functional: 



I n 

77 * » 



n 

i=l 



with \\<p\\h the L2 norm in the space H . H is chosen to be a reproducing kernel Hilbert Space (rkhs) 
with kernel k. k is supposed to be a symmetric function k : R d x R d — > M. In particular, we have the 
following property (for a detailed introduction of rkhs, see JAT E921) 

\f{x)\<\\f\\ H \\k\\ H . 

We slightly adapt the proof in jBEOll to show that a regularization network is surely stable: 
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Theorem 12 If ^ is a regularization network such that \\k\\n < « and (y — l p{ x )) 2 < ^ erl * « s 
4 ^ - surely stable with respect to the distance doo. 

Proof 

Define A % (<p) := ^Y^jjti( Y j - f( X j)) 2 + M\<P\\h and V n '■= ' D n\{{X l ,Y i )} . ip^ is the minimizer 
of A 1 over H whereas px> n is the minimizer of A. Denote g := tp-p^ — ifD n - 

For t € [0, 1], we have A(tpv n ) — A(ipx> n + tg) is equal to 

n ( n _ i) J2( n ~ ~ yj)9{xj) - , _ (n - l)(vx> n (o;i) - yi)fl(a:i) - 2tA < y-D n ,g>H +t 2 B(g) 

with B{g) the factor oft 2 . 

In the same way, we get that A l (p> V i^) — A % (^p- D i — tg) is equal to 

, _ 1 x Yl( n ~ 1 )('Pi>i n (x j ) - yj)g{xj) + , _ 1 n Xjtfe-t 1 )) ~ Vj)g( x j) + 2tA < <fiT>i,g>H +t 2 B n (g). 

3& 37^ 



definition of A and A 1 , we have A{p Vn ) — A(<p>x> n + tg) < and A z (ip V i ) — A l (<p> V i + tg) < 0. 
Thus, we get by summing these two inequalities, dividing by K ^li) and making t — > 0. 

5^(n- l)5r 2 (^j)+5Z [(Vi?*^.) -yj)9(Xj) - {v>vS x i)~y^( x i) +n{n-\)\\g\\ 2 H < 
which leads to 

n(n - l)\\g\\ 2 H < £ \{^ n { x i) ~ vM^i) - (fv^) - ^M^)] < 2(n - l)VM K \\g\\ H 

6y assumptions. 
Thus, we have 

\\g\\ H < 2(n-l)VM K \\g\\ H /n\ 

and also, for all x, y 



\(<pv n (x) - y) 2 - {<Pv> n {x) -VY\< 2VM\p v Jx) ~ ¥> D * (a;)| < AM^/nX. 

□ 

Another popular example is given by the k-nearest neighbors which are strongly stably with respect 
to d\. 

Example 13 (k-nearest neighbors ) In the k-nearest rule, the machine learning is a function of 
X and of the k nearest observations to X from (Xi, ...,X n ) and of the corresponding (Yi, ...,Y n ). 
Because there may be ties in determining the k nearest neighbors, we use an independent sequence 
(Z, Zi, Z n ) of i.i.d uniform random variables in [0, 1]. Xj is nearer Xi to X if: 

1. \\Xj-X\\ < \\Xj-X\\ or 

2. \\Xj - X\\ = \\Xj - X\\ and \Z 3 - Z\ < \Z { - Z\, or 

3. \\Xj -X\\ = \\Xj - X\\ and Z } = Z t and j < i. 
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The last event does not count since its has zero probability. 

Denote jd the maximum number of distinct points in M. d that share the same nearest neighbor. It 
can be shown that 7^ < 3 d — 1 and other lower and upper bounds can be found in 'ROC 631. Re- 
call the following lemma from JDEWA791: suppose (X\, Z\), . . . , (X n , Z n ) is the sequence obtained 
from the data by omitting the Y\,...,Y n . If, for each j, the nearest neighbor to (Xj,Zj) is found 
from (Xi, Zi), . . . , (Xj+i, Zj + ±), (X n , Z n ). Then no point (Xi, Y{) can be the nearest 

neighbors to more than 7^ + 2 of the remaining points. 
We can derive the next result following the proofs in 'DEWA79J. 

Theorem 14 Let T> n := ((-Xi, Zi,Yi), . . . , (x, z, y)) be a learning set. Suppose $ is a k local rule. 
Then we have for all e > 0, 

— (n - IV 3 

Pi(E XiYtZ \L(Y, $((X, Z), V n )) - L(Y, $((X, Z), 2£))| > e) < 6exp( 5 ^ fa + J 2) ) 
with V l n :— T> n \{(Xi, Yi, Zi)} and i a fixed index. 

It says that the k nearest rule satisfies strong stability property with respect d\ and ||P ni (/ n — Pn|| a 

with a < 1/3. 

Proof 

Consider one local rule first. 

Let m be an integer. Consider an independent identically distributed ghost sample 

((-^n+ii ^ri+ii Z n+ i), . . . , (X n+m , Y n+m , Z n+m )). 
Denote Tn+m := ((Xi,Yi,Z\), . . . , (X n+m ,Y n+m , Z n+m )) and T£ +m ■= T n+m \{(Xj,Yj, Zj)}. 

. Li := E X ^ Z \L(Y, Z),X>„)) - L(Y. Z),X>* ))| 

• L 2 := ± J2T=i \ L ( Y n +j M(X n+j ,Z n+j ),V n )) -L(Y n+3 ,<t>((X n+3 ,Z n+J ),Vl))\ 

• L 3 := i J2T=i \ L (Y n+J ,H(X n+3 ,Z n+J ),V n )) -L{Y n+J M(X n+p Z n+0 )X+l))\ 

• U := i E™ 1 \L(Y p mX n+j ,Z n+j ),Vi)) - L(Y n+p mX n+p Z n+3 ),T:+l))l 

We have 

Pr(Li > 3e) < Pr(L x - L 2 > e) + Pr(i 3 > e) + Pr(i 4 > e). 
By Roeff ding's inequality we have Pr(Li — L 2 > s) < exp(— 2me 2 ). 



Now we get for the second term 

m 

Pr(L 3 > e) < Pr(- ^ 1 «((jr f , +i ,z f , +i ),i? n )^»((JC B+ ,,z B+J ),7^M) ^ e ) 

^ m 

j=i 

wra</i A(n + j) the event that the nearest neighbor of [X n+ -, Z n+ j) from Tj^n * s o-ttained in the ghost 
sample T% + £\V n . 
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From \DEWA7Sfj , we have, if {>y d + 2)m < (n + m)e/2, 

^ m 

Pr(-^2l A(n+j) >e)< 2exp(-2m(e/2) s 



m 

3=1 

In the same way, we find that Pr(L 4 > e) < 2 exp(— 2?7i(e/2) 2 ) i/ (7^ + 2)m < (n — 1 + m)e/2 
Taking m — ^ n ~^ > we obtain 

Pr(L3> £ )<2exp(^-i|!) 

and Pr(L 3 > e) < 2exp(^g^f ). 

For an arbitray k, it is sufficient to replace (7^ + 2) by k{^d + 2). 
□ 

A last popular example is given by the Lasso which is strongly stable with respect to d\. 

Example 15 (Lasso) We follow \BTW07I who defi nes Lasso-type methods in the following way. Let 
((-Xi, Yi), {X n , Y n )) be a sample of i.i.d. pairs distributed as (X,Y) € (X,M), where X is a borel 
subset ofM. d . We denote by fi the distribution of X on X. Let f(X) = K(Y\X) be the unknown 
regression function and Tm = {fi, •••> /m} oe a dictionary of real-valued functions fj that are defined 
on X. We use a data dependent l\-penalty. Formally, for any A = (Ai,...,A M ) € K M , define 
f\{ x ) ~ 2j=i ^jfj( x )- Then the penalized least squares estimator of X is 

n 

A = argmin{l/n^(F i - /a(AQ)) 2 + pen(X)} 
i=l 

where 

M 

pen(X) = 2^u n j\Xj\ with w ntj = r nM \\fj\\ n 

3=1 

where \\g\\„ = n^ 1 X)"=i 9 2 (Xi) for the squared empirical empirical L2 norm of any function g : X —yR. 
The tuning sequence r n> M > is defined by r n> M '■= A^J\og{M) / n for A large enough. Then we have 

fn = fx 

Define 

M 

M(A) = ^I {Ai#0} 
3=1 

the number of non-zero coordinates of X. 

We recall the definition of weak sparsity in \BTW07^ . Let C/ > be a constant depending only on f 
and 

A = {XeR M :\\f x -f\\ 2 <C f rl M M(X)} 

where 

\\g\\ 2 = f g 2 (xMdx) 
J x 

If A is not empty, f has the weak sparsity property relative to the dictionary {f\, ...,/m}- 
We have then the following theorem 
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Theorem 16 Assume the general assumptions (Al)-(AS) and consider the notations in 'BTW071. 
Then, for all X G A, 

Pr(\E x , Y (Y - f n (X)) 2 - E X , Y (Y - f n -i(X)) 2 \ > 2B 1 K^r^ M M{\)) < 2ir 

with 7T n -i.ni(\) a small probability defined in \BTW07^ . 

In other words, the Lasso-type algorithm is weakly stable with respect to d e and ||.||i. 
Proof 

According to theorem 2.1. in [BTWOll, we have: 

Pr(E x , Y \f n (X)) - f{X)\ 2 < BikJ}t* iM M{\)) > 1 - ir n , M (X). 

Thus, denote n := Py{\E x , y {Y - f n {X)) 2 - E XY {Y - f n -i(X)) 2 \ > 2B 1 K^r 2 n M M{\)). We 
obtain: 

7T = Pr(\E x (f(X) - f n (X)) 2 - E XY {f{X) - f n -i{X)) 2 \ > 2B 1 niy nM M{\)) 

< Px(E x (f(X) - f n (X)) 2 > B lK iy nM M(Xj) 
+ Pr(E x (f(X) - fn-M)) 2 > B lK ^rl M M(X)) 

< 27r„_i,M(A). 

□ 

As seen in the following table, we retrieve with those notations the different notions of stability 
introduced by |DEWA79j . |KEA95j and also [BEOT] . IK i:\IY02l. 



stability \ distance 


doc 


di 


d e 


Weak 


weak Sj hypothesis stability 


weak (A, /^stability 


weak (A, 5) error stability 


IKUNIY02I 


IKUNIY02I 


IKUNIY02I 


Strong 


strong (A^ hypothesis stability 


strong (A, J) instability 


strong (A, S^j error stability 




|KUNIY02 DEWA79] 


KUNIY02 


IKUNIY02I 


Sure Stability 


uniform stability 


IDEWA79I 


error stability 




iBEOll 




IKEA951 



To motivate this approach, we also quote a list of class of predictors satisfying the previous stability 
conditions. 



stability distance 


doo 


di 


d e 


Weak 






Lasso 


Strong 


Adaboost ( kuniyo2|) 


-ERM < KUNIY02I) 
-/c-nearest rule 


Bayesian algorithm 
[KEA95] 


Uniform 


Regularization networks 







Remark 17 We omit other weaker definition of stability such as defined in WEOlf . JDEWA79J, and 
'KUNIY021. They consider bounds on the first moment o/Ex>„d(V'c/„ > VVi) instead of probability bounds. 
Under these assumptions, they obtain polynomial upper bounds on Pr (\Rcv — Rn\ > £)• It is would 
be interesting to explore the behaviour of cross-validation estimates under these hypotheses. However, 
this cannot be done with the techniques presented in this paper and is left to further investigation. 

The main notations and definitions are summarized in the next table: 
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Name 


Notation 


Definition 




Risk or generalization error 


Rn 


E P [L(Y,cf>(X,D n )) | 


D n ] 


Resubstitution error 


Rn 




Dn)) 


Cross-validation error 


Rev 


EytrP n ytslbyt-r 





Table 2: Main notations 



3 Results for risk assessment for stable algorithms 

Our goal is now to derive upper bounds for the probability that the distance between the cross- 
validation estimator and the generalization error is greater than s > 0: Pr (\Rcv — Rn\ > £)■ 



3.1 Hypotheses H 

Let T> n be a learning set of size n. Let V£f <~ Q be a training vector independent of T> n such that 
the cross-validation is symmetric -i.e. Pr(V^ = 1) is a constant independent of i -and the number 
of elements in the training set is equal to np n . Let d be a distance among d e , d\ , doo . At last, we 
suppose that the loss function L is bounded by 1. We derive the following general results that stands 
for general cross-validation procedures and stable algorithms. 



3.2 Strong stability 

We state two results according to the class of stability. We will use the definition of strong difference 
bounded introduced by KUT02 and a corollary of his main theorem inspired by : McD89 . 

Definition 18 (Kutin[KUT02]) Let Q, 1 , . . . , VL n be probability spaces. Let Q = Y[k = i an d let X 

a random variable on Q. We say that X is strongly difference bounded by (6, c, 5) if the following 
holds: there is a "bad" subset B C fi, where S — P(B). If uj,uj' £ O differ only in k-th coordinate, and 
uj (£ B, then 

\X{w)-X{w')\ < c. 

Furthermore, for any u),u)' G fl, 

\X(u)-X(u)')\ < b. 

We will need the following theorem. It says in substance that a strongly difference bounded function 
of independent variables is closed to its expectation with high probability. 

Theorem 19 (Kutin[KUT02]) Let fti, . . . , tt n be probability spaces. Let Q, — Jlfc=i an d let X a 

random variable on fl, which is strongly difference bounded by (b,c,S). Assume b > c > and a' > 0. 
Let pi — E(X). Then, for any r > 0,a' > 0, 

We are now in position to derive 

Theorem 20 (Cross-validation strong stability) Suppose that W holds. Let ^ a machine learn- 
ing which is strong (A, (8 n Pn ) n Pn , d, Q) stable. Then, for all e > 0, 

Pr(\R cv -R n \ >e + \(2 Pn ) a ) < 2 exp(-2np„e 2 ) + <5„ jP „ . 
Furthermore, if d is the uniform distance dao, then we have for all e > 0: 
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Pr (|flc V - > e + «5 n(1 _ Pn) + A(2 Pn ) Q ) < 4(exp(- ) + -<5„, P J, 



with <Cp„ = <W„ l)<^n+l,l/(n+l)- 

TTius, i/we choose a = h\(2np n ) a , 



e 2 



Pr (|i^ - R n \ > e + ^ + A(^n < 4(exp(- 8(10A)2 ; ( ^ )2 J + ^,< pn ). 
Proof 

1. For the general case, denote B the bad subset, i.e. B = {sup i7nGsupport (Q) [jfr~~|^ ^ A}. 
Since "J is strong (A, (5 n ,p„)n,p„) d, Q) stable, we have Pr (B) < S n , Pn - It is sufficient to split 
\Rcv — Rn\ according to a benchmark, namely R n (i- Pn ) '■= ¥,yt r Fip v t r . Thus, we get 

Pr (\Rcv-Rn\ >e + X{2 Pn ) a ) < Pi (\Rcv -Rn{i- P „) \ > e) + Pr (K ( i_ Pn) - > A(2p„) Q ) 

The first term can be bounded by conditional Hoeffding inequality (see chapter 1). Thus, we 
obtain 

Pv(\R cv -Rn(i- Pn ) \ > e) < 2exp(-2np„£ 2 ). 
For the second term, notice that: 

\Rn(l- Pn ) ~ Rn\ = \E V t r PlP V t r - Flp n \ < E V tr\Fl]j V tr - Flp n \. 

Recall that for any d € {d e , d\, doo}, we have |PVv tr — IVn| < d(ipyt r , ip n ) and | |P„ v tr ~ Pn||a = 
(2p„) Q - 

Thus, since ^ is strong (A, (5 n ) n ) stable, we have 

Pr(\R n(l _ Pn) -R^ > \{2 Pn ) a ) < Pr( sup d(ip v tr,^ n )/\\F n>v tr -F n \\ a > A) 

V^ r £support(Q) 

= Pr (B) < S, hPn 

2. In the particular case, when d — d^, the most stable notion of stability, we can obtain a stronger 
result. For this, we recall two very useful results. 

We proceed in three steps as in {BE021 . [KUNIY02 by using a bounded difference inequality 

• first, we show that the expectation of Rev ~ Rn is small, 

• secondly, we show that the function Rev ~ Rn seen as a function / of Z\, Z<i, . . . , Z n is strongly 
difference bounded, i.e.: with high probability, there exists constants ci,...,c„ such that we 
have for all i, for all z G Z, 

\f(Zi, . . . , Zi, . . . , Z n ) — f(Zi, . . . , Zi-i, z, Z i+ i, . . . , Z n )\ < a, 

• use theorem [T9l with the first two points, 

• at least, use arguments of symmetry to conclude. 
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1. The expectation of Rev — Rn is small 

Let us denote v* r ,v^ s fixed training and test vectors. 

P® n (R CV - Rn) = F® n (E V trF niV tslP V tr - PV>„) = P® n P(VV«r - Vn) 

since 

P^Ey^P^v^T/V^ = Ey^P^P^sT/V^, = Ey^P® n PVv^ = F® n Flp v tr- 

where the first equality comes from the linearity of expectation, the second from the fact that 
F n ytr are independent of P„ ; yt s , and the third from the i.i.d. nature of (Z»)j. 

Recall that P(^ v *r — ij) n ) < d(ip v tr,il) n ) where d stands indifferently for d\,d e or dca. Thus, 
P 0n P(^> v tr— tpn) < F® n d(ip v tr, ip n ). By conditioning according to the small values of d(^; v tr,^ n ), we 
obtain 

P 8 "d(i,,i)=P 8 "(d(^,i)|B)P 8n (B)+P*(# v ,,^)|B c )(l-P* 

< 1 x S n , Pn + AP® n ||P n , v tr - P„|| Q x (1 - 5 n , Pn ) = 5 n , Pn + X(2p n ) a (l - 5 n , Pn ) 

Eventually, we get F® n {R C v - Rn) < S n , Pn + A(2p„) Q . 

2. Rev — R n is difference bounded with high probability 

Denote f{Z\, Z2, ■ ■ ■ , Z n ) := Rev — Rn- Let z e Z. Let V n+ i = T> n+ \ U {z}. Now denote 
B = B1UB2 where 

Sl = { SUP MP -P II ^ A ^ 

(7„Gsupport(Q) IFn,C/„ ^n||a 

and 

rf(V> e i ,Vn+l) 

^ = { sup — " +1 >A} 

l<j<n+l ||ir n+ l 1 ej i+1 ~ ^n+l\\a 

with ejj +1 the binary of size n + 1 equal to everywhere except on the i-th coordinate e l n+1 k :— 
l(fe^i) for 1 < k < n + 1. Under our assumptions, we have 

Pr(B)< S ntPn + (n + l)5 n+1A/n+1 



We want to show that with high probability there exist constants c, such that for all i € 
{1, ...,n}, for all z e Z, |/(Zi, . . . , Z it . . . ,Z n ) - f(Z 1 , . . . , Z^, z, Z i+1 , . . . , Z n )\ < q. 

Notice that 



\f(z u ..../., y.„\ f(Z!, ...,z,...,z n )\ = KE^rP^^ - p^+i) 

-(Ey ntr P^ s Vy--P^ +i )| 

< \E V trF n yts^ V tr -Ey^rP^^Vy^l 

+ |P^;-P^ +1 |. 
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with P n y tr the weighted empirical measure on the sample 

£n = {Z\, . . . , Zi—x, Z, Zi+i, . . . , Z n } 

and ipy tr the predictor trained on £yt r . 

So, first, let us bound the second term, recall that 

-Vv B+1 )l < < d(^+ ; >™ + i) + rf(^„ + i,V e . +1 ) 

with V'n+i trained on 2? n +i = {Zi, . . . , Zi—i, Zi, Zj+i, . . . , Z n , z}. Thus, we have on B c , \Pip e n+i- 

P^ +1 l<2(^r) Q . 

To upper bound the first term, notice that 

\E V i,rF n ^ V tr-E Vi rF' n ^lP v t r \ = \Eytr (F n>V ts {^ytr - </V - ) I ^ = l ) X i 1 ~ P») 

+ Ev«r((P ni v*. -P^^O^Ki - 1) ><Pn|. 

We always have for any -0, |(P„_v^ s — P„ ytO^I < l/^Pn thus 

|E Vntr ((P„ )V t a - P;^ ts )Vy-lC = 1) x Pn\ < 1/n 



Until now, the previous lines hold independently of d 6 {d e , di, doo}- We still have to bound 
|Ey j tr(P n _Yt s (ip V t r — ipytr^V^ = 1)|. In the particular case of the most stable kind of stability 
(i.e. when d — doo), we have 

|E^(P„ iVii .(^ - = 1)| < E^^^yt^V^^lC = !)• 

On S c , we get (-^y^ , ip v t r ) < dooC^y^, ip n+ i) + doo(^n+i, V'y*'-) < 2(2Ap„) Q - 
Thus, on £? c , we have 

Ey^^oo^y-,^,.)!^ = 1) < 2(2Ap„) Q . 

Putting all together, with probability at least 1 — S n „ , 

sup |/(Z 1; . . . , Zi, . . . , Z n ) - f{Z x , ...,z,...,Z n )\< 5(2X Pn ) a . 

l<i<n,z£Z 

3. Rev — Rn is closed to zero with high probability 

Applying theorem [HO we obtain that for all e > 

Pr(i? cy - Rn > e + S + X(2 Pn ) a ) < Pr (R cv ~ Rn - E Vn {R C v - Rn) > e) 



< 2(exp(- 



8n(5(2Ap 



„) Q + a') 2 ^ 



" 2(eXp(_ 8(10A) 2 n(2A Prl )2J + 5 (2Ap„)« } 
by taking a = 5(2Ap„) Q . 
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By symmetry, we also have Pr (Rev — R n < — (s + 8n,p„ + 2Xp n )) < 2(exp(- 

5{2\ P )° ^ra pn) wm ch allows to conclude. 

□" 



Theorem 21 (Strong stability) Suppose thafH holds. Let ^ be a machine learning which is strong 
(A, (Sn,p n )n,p„i d) stable. Then, for all e > 0, we get 

Pr(\R cv -Rn\ >e + X(2 Pn ) a ) < 2 exp(-2np n e 2 ) + n(n)5 n , 

where n(n) is the number of training vectors in the cross-validation. 

Furthermore, if the distance d is the uniform distance d^, then we have for any e > 0: 



Pr (\R CV -R n \>e + 5 n<Pn + X(2 Pn ) a ) < 4(exp(- ) + - K (n)6 ), 

8n(5(2Aj5„J Q! + a ) z a' ' F 

with 5 n pn = S„, tPn + (n + 1)<5„ !/„. Thus, if we take a — 5(2Xp n ) a , we get 

~ e n i 

PHWcv - R n \ > e + *„, Pn + X(2 Pn r) < 4(exp(- 8(10A)2n(2Apn)2a ) + ^^*W<WJ. 

Proof 

For the first inequality, it is sufficient to use remarks [8] 

For the second one, we can follow the previous proof, using remarks [8] and noticing that if we denote 
B vtr := {d(t/j v tr-,ip n ) > A||P„. v tr — Pn||}, then, we have 

< 1 x 5 n>J) „ + AP® n ||P„ |V *- - P„|| Q x (1 - S n , Pn ) = 6 n , Pn + X(2 Pn ) a (l - 6 n , Pn ). 

Eventually, we get F® n (R CV ~ Rn) < S n , Pn + A(2p„) a . 
□ 

Now, we derive results for the hold-out cross-validation which does not make a symmetrical use of the 
dataset. We obtain 

Theorem 22 (Strong stability and hold-out) Let^ be a machine learning which is strong (A, {S n ,p n )n,p n ^) 
stable. Then the hold-out (or split sample) cross-validation satisfies for all s > 0, 

Pr(|iW--Rn| >s + X(2 Pn ) a ) <2exp(-2np 
Furthermore, if the distance is the uniform distance d^, then we have 



Pr(\R cv -R n \>e + 8 n . Pn + X{2 Pn ) a ) < 4(exp(- 



s 2 



4A(2p„) Q + 1/np 

with <Cp„ = S n,p„ + n$n,l/n 



8(4A(2p„) Q + l/n Pn y ' 

^n,p n ) i 
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Proof 

For the first inequality, it is enough to use remarks [5J 

For the second one, we start as previously. First, we bound in the same way the expectation. 
Secondly, we show that Rev — Rn is difference-bounded with high probability. 

Denote f(Z\, Z%, • • • , Z n ) := Rev ~ Rn- Let z € Z. Now denote as previously B := B\ U B2 with 
Bl = 1 IF — tr-P, JL — an< ^ ^ 2 = i su Pi p " _ Pn tt > A}. Eventually, we have Pr(i3) < 

<^n,l-p„ + n ^n,l/n 

We want to show that with high probability there exists constants Ci such that for all i, for all zeZ, 
\f{Z\, ...,Zi,..,, Z n ) — f(Zx, . . . , 2, . . . , Z n )| < a. Since V^ r = v^ s fixed vector in the case of hold-out, 
notice that: 

\f(Z X , ..../. Z n ) f(Z x , ...,Z,...,Z n )\ = |P„,v*^v- " F^j+i) - (Pn,vS-^vg- " P ^ +1 )l 

< |Pn,v*'VVS--Ev*r P n,v?^v*rl 
+ |P^ e „+i-P^ el+i | ! 

with P n vtr the weighted empirical measures of the sample £ n — {Zi, . . . , Z^_i, z, Z i+1 , . . . , Z n } and 
•0 vtr the predictor trained on £ v * r - 

So, first, let us bound the second term, recall that: 

|PO e ™ + ; -^< +1 )l < d (^+; J V'ej 1+1 ) < d(V> e »+l,VWl) +d(^„+i,V e * >+1 ) 
Thus, on B c , |P^ e „+i -P^ + J < 2A(^ T ) Q . 
To upper bound the first term, notice that: 

|Pn,vf^vr ~ P n,vjf^v*rl = l P »,vJ? W»v«' " VV-)1{< ! = 1} + (P„,v- - F^ |V «.)Vv*-l{v«',=l}l 

We always have for any 1/), |(P„ )V ^ — P n v t»)i/'| < l/npn thus 

|(P n|V *. -P^ v ,.)^ v *rl M a. = 1} | < l/fip B 

We still have to bound |Pn,v*, s C0v* r — ' ! /'v t '')l{v t ''.=i}l- As m the previous proof, we have when d = d^, 
|P„, v t» (ip v tr - ^ r )l {v tr. =1} | < d 00 (V'v*j-)^v*'-) 1 {v^ < =i} 

On S c , d 0o (V , v* r )V'v t '-) - d°o(ipvi r , i>n+i) + d<x>(V'n+i,V , v*'0 < 2\(2p n ) a . Thus, on _B C we get 

rfoo(^v*r,V , v*'-) 1 {v^ i =l} < 2A(2p n ) a . 

Putting all together, with probability at least 1 — 5 npn , 

sup \f(Zx, ...,Z h ...,Z n )- f(Zx, . . . , z, . . . , Z„)| < 2A(— + maxC^)" 1 , 2A(2p„) a ) 

< 4A(2p„) Q + (n^)- 1 

To conclude, apply again theorem [T9l 
□ 



17 



3.3 Weak stability 

We now derive results that stands for general cross-validation procedures and weakly stable predictors. 
We recall here the interest of the notion of weak stability. For some class of machine learning, the 
notion of strong stability may be too demanding. That is why weak stability is introduced. As a 
motivation, algorithms such as Adaboost satisfies the following definition of weak stability. 

We will use the definition of weak difference bounded introduced by |KUT02| and a corollary of his 
main theorem. 

Definition 23 (Kutin[KUT02j) Let 0.x,..., tt n be probability spaces. Let fl = an d ^ X 

a random variable on Q. We say that X is weakly difference bounded by (6, c, S) if the following holds: 
for any k, 

y s {uj,v) eQxQ k , F(\X(u>) - X(u')\) < c 

where ui k — v and u i — U3{ for i ^ k. and the notation \/ s ui, $(w) means "$(u;) holds for all but but a 
S fraction of Q " 

\X{uj) -X(u')\ < c 
Furthermore, for any u),U)' G fl, differing only one coordinate: 

\X(uj) - X{uj')\ < b 

We will need the following theorem. It says in substance that a weakly difference bounded function 
of independent variables is closed to its expectation with probability. 

Theorem 24 (Kutin[KUT02]) Let . . . , Sl n be probability spaces. Let fl = Ofc=i ^fe an< ^ ^ X 
a random variable on Q. which is weakly difference bounded by (b,c,8). Assume b > c > and a > 0. 
Let /i = E(X). Then, for any e > 

Pr(|X - „| > e) < 2exp(- 1Qnc2(1 + ii _ )2 ) + — exp(^)) + 2nS^. 

Theorem 25 (Cross-validation Weak stability) Suppose thatrl holds. Let^> be a machine learn- 
ing which is weak (A, (<5„)„, d, Q) stable with respect to the distance d. Then, for all e > 0, 



Pi(\R CV -Rn\ >e + \{2 Pn ) a ) <2exp(-2np„£ 2 )+(5 



Furthermore, if the distance is the uniform distance doo ; we have for all e > 0: 



* aft* - «.i > . + + < ^-io^s^^a^?' 

o a' 1 / 2 

ZnOn.Pn f ,'1 In \\ 

+ 5M2^ eXP( 4n(5A(2 Pn )«) 2 )) + "^»" 

with S' n pn = 28 nA / n + 5 n . Pn 
Proof 

In the following, denote B the bad subset, i.e. B = U v trB v tr with B v tr = {d(ipp tr , > A||P n) „t 
P„||}. Since ^ is strong (A, [S n ) n ,p„, d, Q) stable, we have P(B) < S n , Pn . 
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1. For the general case, it is again sufficient to split \Rcv ~ Rn\ according to the same benchmark, 
namely R n ^ 1 _ p - ) — E,yt r Vip v t r . 

Thus, 

Pr(|Ecy - R n \ >e + X(2 Pn ) a ) < Pr(\R CV ~ R n (i- P )\ > e) + Pr (\R n (i~ P ) -R n \> A(2 Pn ) Q ) 
The first term can be bounded as previously by 2 exp(— 2np n e 2 ). 

For the second term, notice that |i? n (i_ p ) — R n \ = \EytrPtfj v tr — EytrPi/; n | < Eytr\Pip v t r —Ptp n \. 
Recall that \Pip v tr - Wip n \ < d(ip v tr,ip n ) and \ \P n .y^ -PnlT = X{2p n ) a . Thus," since * is weak 
(A, (S n ,p„)n,p n ,d) stable, we have 

Pr(|A n(1 _ Pn) -Rn\ > \(2 Pn ) a ) < Pr(E <r d(^,-0 n ) > A(2p„) a ) 

Pnllo}) 

= Pr (U^trS,,^) < K(n)5 ni p„. 

2. In the particular case, when d = doo, we can also obtain a stronger result. 

We proceed in three steps as in [BE02 , KUNIYQ2] by using a bounded difference inequality: 

1. first, we show that the expectation of Rev — Rn is small of the same order as for the strong 
stability. 

2. secondly, we show that the function Rcv — R n seen as a function / of Z\, Z2, ■ . ■ , Z n is weakly dif- 
ference bounded, i.e. there exists constants c\, . . . , c„ such that for alii, if Z\, . . . , Zi, . . . , Z n , Zy 
i.i.d. random variables, we have with high probability 

\f(Zi, . . . , Zi, . . .,Z n ) - f(Zi, ...,Zi>,..., Z n )\ < Ci. 

3. finally, we use theorem with the first two points to conclude. 

1. The expectation of Rev — Rn is small 

As previously, denote v^ r ,v^ s fixed vectors. We still have 

since F^EytrP^yts <ip v tr = E v tr¥® n ¥ ntV utp v tr = E v t r P® n W>il> v tr = P® n P%l) v tr where the first 
equality comes from the linearity of expectation, the second from the fact that P n ,v tr are inde- 
pendent of ¥ n< vts, and the third one from the i.i.d. nature of (-Zj)i. 

Recall that ¥(ip v tr — ip n ) < d(ip v tr y ip n ) where d stands indifferently for d\,d e or d^. Thus, 
P®"P(^ v tr — ip n ) < F® n d(iJ} v t r , ip n ). By conditioning according to the small values of d(ip v tr, ip n ), 
we obtain 

F® n d(i> <r ,ij n ) = P® n (d(Vv-,^)|Bv-)IP 0n (Sv-) 

< 1 x 6 n , Pn + AP® n ||P„. v *, - P„|| a x (1 - 6 n<p J < 6 n<Pn + X(2 Pn ) a . 

Eventually, we still have F® n (Rc V ~ Rn) < K, Pn + A(2p„) Q . 
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2. Rev Rn is difference bounded with high probability 

Denote f(Z 1 ,Z 2 , Z n ) := R C v - Rn- 

We want to show that for all i, there exists constant a such 

\f(Zi, . . . , Zi, . . . , Z n ) — f(Zi, . . . ,Z i7 . . . , Z n )\ < Ci 
with high probability where Z\, . . . , Zi, . . . , Z n , Z\ are i.i.d. variables. Denote 

Fn+l,e; +1 — Pn+lMa 

We proceed as previously where (U v tr B v tr) U Bi U B n+1 will play the role of B. 

\f(Z u ...,Z h ...,Z n )- f(Z u ...,Z[,...,Z n )\ = |(E v<{ rP niVi{ .^r - PV>„)- 

(E V t r p' nyi ^ Kr -Fij'j\ 

+ |PV„-PVll, 

with P n , P n yts the weighted empirical measures of the sample 

T^n = i z ii Z n } 

and ip' n the predictor built on V n . 

So, first, let us bound the second term, recall that: |P(V>n — VOI — ^W'ru^'n) < d(tp n , ip n +i) 
d(ip n +i, ^ n )-with V'n+i the predictor trained on the sample V n+ i = {Zi, . . . , Zi, . . . , Z n , Zi> 
Thus, on B c , we have |P^„ - P^| < 2A(2/n) Q . 

To upper bound the first term, notice that 

\ E V*r P nM°i>Vtr-E V trP^ V ts1p' V tr\= \ E ytr [P n yts ( tpytr ~ ^' V t r ) | V£j = 1) X (1 - pj 

+E V tr((P n Vle -P ny t,)i, v t r \V t l i = l) X pj. 

We always have for all ip, \{¥ n yt s — P n y ts )V>| < l/np n thus we get 

|Ev B *r((P„,v B .. - P^^Vv^C = 1) x p n | < 1/n 

We still have to bound 

|E y ^(p„^ s (^, - i> K r)\v* = i)| < ^(doo^,^)!^; = i) 

On B^ r , dootyytr ,ij) v tr) < d 00 (i> v tr,i) n+1 ) + (tpn+i , V'ljr ) < 2A(2p„) a . 
Thus, we get E^c^ (VV n ^, V^O, = 1) < 2A(2p„) a on (U < B < ) C - 

Putting all together, with probability at least 1 — 1§ n +i,i/(n+i) — S ntPn , 

\f(Z u ...,Z h ...,Z n )- f(Z u ...,Zi,,...,Z n )\< 5X(2p n ) a . 
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3. Rev — Rn is closed to zero with high probability 

Applying theorem [24j we obtain for all e > 0: 



c 2 



+ 5A(2^ eXp( 4n(5A(2 P „)«) 2 )) + ^ 

" 2(eXP( 'l0n(5A(2 P »)") 2 (l + 15rt(5A 2 ( V r) ^ } 
Znd nPn en ji/ 2 . 



5A(2p„)« " v 4n(5A(2p„)«) 



By symmetry, we also upper bound Pr {Rev — Rn < — (e + <^n,p„ + ^(2p n ) a )) by the same quantity. 
□ 



Theorem 26 (Weak stability) Suppose that T-L holds. Let ^ be a machine learning which is weak 
(A, (8 n ,p n )n,p n , d)- Then for all e > 0, we have 

Pr (\R CV ~R n \>e + X(2 Pn ) a ) < 2 exp(-2np n e 2 ) + K{n)8 n>Pn 

where K,(n) is the number of elements in the cross-validation. 
Furthermore, if the distance is the uniform distance d^, we have 



Pr {\R CV -Rn\>e + <5 n , Pn + A(2p„) Q ) < 4(exp(- 



e 2 



10n(5A(2p„)") 2 (l+ 15n(5A - ; ( V)°) ) 2 

9 a' 1 / 2 

+ 5A(2^ eXP( 4n(5A(2p„)«) 2)) + ^ 



Proof. 

For the first inequality, it is enough to use remarks [H 

For the second, it is enough to follow the previous proofs and to notice that Pr(U„ti- B v t r ) < K(n)S nyPn . 
□ 

Similar results for hold-out can be derived in the spirit of proposition [5U We can now use the previous 
probability upper bounds to derive upper bounds for the expectation of \Rcv — Rn\- 

3.4 Results for the L\ norm 

For the sake of simplicity, we suppose here that a = 1. In the general case, we just consider the 
weakest notion: weak (A, (i5 n ,p„)n,p„, d) stability. 
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Theorem 27 (L x norm of cross-validation estimate) Suppose that TL holds. Let $ be a machine 
learning which is weak (A, {5 n ,p n )n,p n > d) stable. Then, we have 



^vjRcv -R n \< 2Ap„ + J— + 5 n , Pn . 

V n Pn 

Furthermore, if "J is a machine learning which is strong (A, (5 n ) n , d^,, Q) stable, we have 
EvjRcv - R n \ < 5 n ,p n + 2 APn + 51X^/np n + — — S n , 

\)Ap n 

Proof. 

These inequalities are a consequence of the previous propositions and of the following lemma (for a 
proof, see e.g. |DGL96j ): 

Lemma 28 Let X be a nonnegative random variable. Let K,C nonnegative real such that C > 1. 
Suppose that for all e > 0, P(X > e) < Cexp(-Ke 2 ). Then: 



EX < ■ ^ 



K 

For the second one, it is enough to follow the previous proofs and to notice that Pr(Uy**--Byf) < 
□ 

We deduce that 

Corollary 29 Suppose that % holds. If ^ be a machine learning which is weak (A, (8 n ,p n )n,p n id) 
stable, we define the splitting rule p^ = (l/v / 24A) 2 / 3 (l/n) 1 / 3 . Then, we have 



E V JR CV -R n \ <4(A/n 



1/3 



Furthermore, if ^ is a machine learning which is strong (A, (5„)„,d 00 ,Q) stable, we use leave-one-out 
cross-validation for n large enough. And we have 

E Vn \R cv -Rn\ =O n (A/VS). 

Proof. 

Recall that for a large class of learning algorithm, we have in mind that 6 ntPn = O n (p n exp(— n(l— p n ))- 
Thus 2Ap„ + + <5„ iPn < 4Ap„ + y^-- We can differentiate this last bound seen as a function 

of p n . We obtain p* n = (l/\/24A) 2 / 3 (l/n) 1 /3. Thus, we deduce that Epjflcv - R n \ < 4(A/n) 1 / 3 . If 
^ is a machine learning which is strong (A, (S n ) n , doc, Q) stable, we obtain 5 ntPrl + 2Xp n + 51Xy/np n + 
9Ap n ^n,p n — 4Ap n + 51A-y/np n for n large enough since -<L„ = O n (n 3 exp(-n/2)) if p„ < 1/2. 
Thus, p* = 1/n for n large enough and Ex> n \Rcv — R n \ = O n (X/y/n). 

□ 

We have obtained the following conclusions: 

• Cross-validation is consistent as an estimator of the generalization error of stable algorithms. 

• There is a tradeoff interpretation in the choice of the proportion of elements p n of the test set: 
the smaller p n is, the greater the term B(n,p n ,e) is controlled but the less the term V(n,p n ,e) 
is upper bounded. 
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In the general setting, our bounds require that the sizes of the training set and the test set grow 
to infinity. 

In the particular case of the stability with respect to the most stable kind of stability (namely 
the uniform stability), we can have a stronger result: the number of elements in the test set 
does need to grow to infinity for the consistency of symmetric cross-validation procedures. But 
we lose this property with the hold-out cross-validation. 

Symmetric cross-validation out performs hold-out cross-validation for large sets. 

At last, as far as the expectation E-p n \Rcv — Rn\ is concerned, we can define a splitting rule in 
the general setting. 
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4 Appendices 
4.1 Inequalities 

We recall three very useful results. The first one, due to [HOEF63], bounds the difference between 
the empirical mean and the expected value. The second one, due to |VC71j . bounds the supremum 
over the class of predictors of the difference between the training error and the generalization error. 
The last one is called the bounded differences inequality McD89 . 

Theorem 30 (HoefFding's inequality) Let Xi t __ X n independent random variables in [ai,bi\. Then 
for all e > 0, we get 

Pi^X, -E(^Xj) > ne) < e ^< b >-"^ . 

Theorem 31 (McDiarmid, [McD8 9]) LetX\ y ...X n be independent random variables taking values 
in a set A, and assume that f : A n — > 1Z satisfies 

Vi, sup \f(xi,...,x n )-f(xi,...,x i f,...,x n )\<c i . 

X-L,...,Xi,...,X n 

X i 

Then for all e > 0, we have 

P(f(X h X n ) - Ef(X lt X n ) >e)<e ^ . 
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